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Abstract 

Many  fundamental  multi-processor  coordination  problems  can  be  expressed  as 
counting  problems:  processes  must  cooperate  to  assign  successive  values  from  a 
given  range,  such  as  addresses  in  memory  or  destinations  on  an  interconnection 
network.  Conventional  solutions  to  these  problems  perform  poorly  because  of 
synchronization  bottlenecks  and  high  memory  contention. 

Motivated  by  observations  on  the  behavior  of  sorting  networks,  we  offer  a 
completely  new  approach  to  solving  such  problems.  We  introduce  a  new  class  of 
networks  called  counting  networks ,  i.e.,  networks  that  can  be  used  to  count.  We 
give  two  counting  network  constructions  of  depth  log2  n,  using  nlog2n  “gates,” 
avoiding  the  sequential  bottlenecks  inherent  to  former  solutions,  and  substantially 
lowering  the  memory  contention. 

Finally,  to  show  that  counting  networks  are  not  merely  mathematical  creatures, 
we  provide  experimental  evidence  that  they  outperform  conventional  synchroniza¬ 
tion  techniques  under  a  variety  of  circumstances. 
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1  Introduction 


Many  fundamental  multi-processor  coordination  problems  can  be  expressed  as  counting 
problems:  processors  collectively  assign  successive  values  from  a  given  range,  such  as 
addresses  in  memory  or  destinations  cn  an  interconnection  network.  In  this  paper, 
we  offer  a  completely  new  approach  to  solving  such  problems,  by  introducing  counting 
networks ,  a  new  class  of  networks  that  can  be  used  to  count. 

Counting  networks,  like  soiling  networks  [2,  5,  7j,  are  constructed  from  simple  two- 
input  two-output  computing  elements  called  balancers ,  connected  to  one  another  by 
wires.  However,  while  an  n  input  sorting  network  sorts  a  collection  of  n  input  values 
only  if  they  arrive  together,  on  separate  wires,  and  propagate  through  the  network  in 
lockstep,  a  counting  network  can  count  any  number  N  n  of  input  tokens  even  if  they 
arrive  at  arbitrary  times,  are  distributed  unevenly  among  the  input  wires,  and  propagate 
through  the  network  asynchronously. 

Figure  2  provides  an  example  of  an  execution  of  a  4-input,  4-output,  counting  net¬ 
work.  A  balancer  is  represented  by  two  dots  and  a  vertical  line  (see  Figure  1).  Intuitively, 
a  balancer  is  just  a  toggle  mechanism  ',  alternately  forwarding  inputs  to  its  top  and  bot¬ 
tom  output  wires.  It  thus  balances  the  number  of  tokens  on  its  output  wires.  In  the 
example  of  Figure  2,  input  tokens  arrive  on  the  network’s  input  wires  one  after  the  other. 
For  convenience  we  have  numbered  them  by  the  order  of  their  arrival  (these  numbers  are 
not  used  by  the  network).  As  can  be  seen,  the  first  input  (numbered  1)  enters  on  line  2 
and  leaves  on  line  1,  the  second  leaves  on  line  2,  and  in  general,  the  Ath  token  will  leave 
on  line  A  mod  4.  (The  reader  is  encouraged  to  try  this  for  him/herself.)  Thus,  if  on  the 
?th  output  line  the  network  assigns  to  consecutive  outputs  the  numbers  i,  i  +  4,  i -f-2-4, .., 
it  is  counting  the  number  of  input  tokens  without  ever  passing  them  all  through  a  shared 
computing  element! 

Counting  networks  achieve  a  high  level  of  throughput  by  decomposing  interactions 
among  processes  into  pieces  that  can  be  performed  in  parallel.  This  decomposition 
has  two  performance  benefits:  It  eliminates  serial  bottlenecks  and  reduces  memory 
contention.  In  practice,  the  performance  of  many  shared-memory  algorithms  is  often 
limited  by  conflicts  at  certain  widely-shared  memory  locations,  often  called  hot  spots 
[25].  Reducing  hot-spot  conflicts  lias  been  the  focus  of  hardware  architecture  design 
[12,  13,  17,  24]  and  experimental  work  in  software  [3,  10,  11,  20,  22]. 

Counting  networks  are  also  non-blocking:  processes  that  undergo  halting  failures 
or  delays  while  using  a  counting  network  do  not  prevent  other  processes  from  making 
progress.  Phis  property  is  important  because  existing  shared-memory  architectures  are 

'One  can  implement  a  balancer  using  a  read-modify-write  operation  such  as  Compare  &  Sivap ,  or  a 
short  critical  section. 
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themselves  inherently  asynchronous;  process  step  times  are  subject  to  timing  uncertain 
ties  due  to  variations  in  instruction  complexity,  page  faults,  cache  misses,  and  operating 
system  activities  such  as  preemption  or  swapping. 

Section  2  defines  counting  networks.  In  Sections  3  and  4,  we  give  two  distinct  count¬ 
ing  network  constructions,  each  of  depth  less  than  or  equal  to  log27t,  each  using  less 
than  or  equal  to  (nlog2n)/2  balancers.  Section  7  describes  how  to  verify  that  a  given 
network  counts.  To  illustrate  that  counting  networks  are  useful,  we  use  counting  net 
works  to  construct  high-throughput  shared-memory  implementations  of  concurrent  data 
structures  such  as  shared  counters,  producer/consumer  buffers,  and  barriers.  A  t-h> 
counter  is  simply  an  object  that  issues  the  numbers  0  to  m  —  1  in  response  to  m  requests 
by  processes.  Shared  counters  are  central  to  a  number  of  shared-memory  synchroniza¬ 
tion  algorithms  (e.g.,  [8,  9,  13,  26]).  A  producer /consumer  buffer  is  a  data  structure  in 
which  items  inserted  by  a  pool  of  producer  processes  are  removed  by  a  pool  of  consumer 
processes.  A  barrier  is  a  data  structure  that  ensures  that  no  process  advances  beyond  a 
particular  point  in  a  computation  until  all  processes  have  arrived  at  that  point.  Com¬ 
pared  to  conventional  techniques  such  as  spin  locks  or  semaphores,  our  counting  network 
implementations  provide  higher  throughput,  less  memory  contention,  and  better  toler¬ 
ance  for  failures  and  delays.  The  implementations  can  be  found  in  Section  5. 

Our  analysis  of  the  counting  network  construction  is  supported  by  experiment.  In 
Section  6,  we  compare  the  performance  of  several  implementations  of  shared  counters, 
producer/consumer  buffers,  and  barrier  synchronization  on  a  shared-memory  multipro¬ 
cessor.  When  the  level  of  concurrency  is  sufficiently  high,  the  counting  network  imple¬ 
mentations  outperform  conventional  implementations  based  on  spin  locks,  sometimes 
dramatically. 

In  summary,  counting  networks  represent  a  new  class  of  concurrent  algorithms.  They 
have  a  rich  mathematical  structure,  they  provide  effective  solutions  to  important  prob¬ 
lems,  and  they  perform  well  in  practice.  We  believe  that  counting  networks  have  other 
potential  uses,  for  example  as  interconnection  networks  [27]  or  as  load  balancers[23],  and 
that  they  deserve  further  attention. 


2  Networks  That  Count 

2.1  Counting  Networks 

Counting  networks  belong  to  a  larger  class  of  networks  called  balancing  networks,  con¬ 
structed  from  wires  and  computing  elements  called  balancers,  in  a  manner  very  similar 
to  that  in  which  comparison  networks  [7]  are  constructed  from  wires  and  comparators. 
We  begin  by  describing  balancing  networks. 
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Figure  1:  A  Balancer. 

A  balancer  is  a  computing  element  with  two  input  wires  and  two  outpu'  wires7  (see 
Figure  1).  Tokens  arrive  on  the  balancer’s  input  wires  at  arbitrary  times,  and  are  output 
on  its  output  wires.  Intuitively,  one  may  think  of  a  balancer  as  a  toggle  mechanism,  that 
given  a  stream  of  input  tokens,  repeatedly  sends  one  token  to  the  left  output  wire  and 
one  to  the  right,  effectively  balancing  the  number  of  tokens  that  have  been  output  on  its 
output  wires.  We  denote  by  x,,  i  G  {0, 1}  the  number  of  input  tokens  ever  received  on 
the  balancer’s  ith  input  wire,  and  similarly  by  y,,  i  G  {0, 1}  the  number  of  tokens  ever 
output  on  its  *th  output  wire.  Throughout  the  paper  we  will  abuse  this  notation  and 
use  x,  (y,)  both  as  the  name  of  the  ith  input  (output)  wire  and  a  count  of  the  number 
of  input  tokens  received  on  the  wire. 

Let  the  state  of  a  balancer  at  a  given  time  be  defined  as  the  collection  of  tokens  on 
its  input  and  output  wires.  For  the  sake  of  clarity  we  will  assume  that  tokens  are  all 
distinct.  We  denote  by  the  pair  (<,6),  the  state  transition  in  which  the  token  t  passes 
from  an  input  wire  to  an  output  wire  of  the  balancer  6. 

We  can  now  formally  state  the  safety  and  liveness  properties  of  a  balancer: 

1.  In  any  state  Xo  +  xx  >  y0  +  y\  (i.e.  a  balancer  never  creates  output  tokens). 

2.  Given  any  finite  number  of  input  tokens  m  =  x0  +  xi  to  the  balancer,  it  is  guar¬ 
anteed  that  within  a  finite  amount  of  time,  it  will  reach  a  quiescent  state,  that  is, 
one  in  which  the  sets  of  input  and  output  tokens  are  the  same.  In  any  quiescent 
state,  xo  +  xi  =  yo  +  yi  =  m. 

3.  In  any  quiescent  state,  yo  =  [m/2]  and  yi  =  [m/2j. 

A  balancing  network  of  width  w  is  a  collection  of  balancers,  where  output  wires  are 
connected  to  input  wires,  having  w  designated  input  wires  x0,  xlt ..,  xu,_i  (which  are 
not  connected  to  output  wires  of  balancers),  w  designated  output  wires  yo?  J/i?  •  •?  Vw-i 

2 In  Figure  1  as  well  as  in  the  sequel,  we  adopt  the  notation  of  [7]  and  and  draw  wires  as  horizontal 
lines  with  balancers  stretched  vertically. 
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Figure  2:  A  sequential  execution  for  a  BlTONIc[4]  counting  network. 

(similarly  unconnected),  and  containing  no  cycles.  Let  the  state  of  a  network  at  a  given 
time  be  defined  as  the  union  of  the  states  of  all  its  component  balancers.  The  safety 
and  liveness  of  the  network  follow  naturally  from  the  above  network  definition  and  the 
properties  of  balancers,  namely,  that  it  is  always  the  case  that  x>  —  l Hi 

and  for  any  finite  sequence  of  m  input  tokens,  within  finite  time  the  network  reaches  a 
quiescent  state,  i.e.  one  in  which  Y1T=o  Vi  =  m- 

It  is  important  to  note  that  we  make  no  assumptions  about  the  timing  of  token  tran¬ 
sitions  from  balancer  to  balancer  in  the  network  —  the  network’s  behavior  is  completely 
asynchronous.  Although  balancer  transitions  can  occur  concurrently,  it  is  convenient  to 
model  them  using  an  interleaving  semantics  in  the  style  of  Lynch  and  Tuttle  [19].  An  exe¬ 
cution  of  a  network  is  a  finite  sequence  s0,  ej,  si, . . .  en,  sn  or  infinite  sequence  s0,  ei,  Si, . . . 
of  alternating  states  and  balancer  transitions  such  that  for  each  (s,,  e,+],  st+1),  the  tran¬ 
sition  e,+ 1  carries  state  Si  to  s,+i.  A  schedule  is  the  subsequence  of  transitions  occurring 
in  an  execution.  A  schedule  is  valid  if  it  is  induced  by  some  execution,  and  complete  if  it 
is  induced  by  an  execution  which  results  in  a  quiescent  state.  A  schedule  s  is  sequential 
if  for  any  two  transitions  e,  =  (t,,6,)  and  e;  =  (tj,bj),  where  t,  and  tj  are  the  same 
token,  then  all  transitions  between  them  also  involve  that  token. 

On  a  shared  memory  multiprocessor,  a  balancing  network  is  implemented  as  a  shared 
data  structure,  where  balancers  are  records,  and  wires  are  pointers  from  one  record  to 
another.  Each  of  the  machine’s  asynchronous  processors  runs  a  program  that  repeatedly 
traverses  the  data  structure  from  some  input  pointer  to  some  output  pointer,  each  time 
shepherding  a  new  token  through  the  network  (see  section  5). 

We  define  the  depth  of  a  balancing  network  to  be  the  maximal  depth  of  any  wire, 
where  the  depth  of  a  wire  is  defined  as  0  for  a  network  input  wire,  and 

m&x(depth(x0),  depth(x i))  +  1 

for  the  output  wires  of  a  balancer  having  input  wires  x0  and  x\.  We  can  thus  formulate 
the  following  straightforward  yet  useful  lemma: 
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Lemma  2.1  If  the  transition  of  a  token  from  input  to  output  by  any  balancer  takes  at 
most  A  time,  then  any  input  token  will  exit  the  network  within  time  at  most  A  times 
the  network  depth. 

A  counting  network  of  width  w  is  a  balancing  network  whose  outputs  yo,--,yw-i 
satisfy  the  following  step  property: 

In  any  quiescent  state,  0  <  y,  —  yj  <  1  for  any  i  <  j. 

To  illustrate  this  property,  consider  an  execution  in  which  tokens  traverse  the  network 
sequentially,  one  completely  after  the  other.  Figure  2  shows  such  an  execution  on  a 
Bitonic[4]  counting  network  which  we  will  define  formally  in  Section  3.  As  can  be 
seen,  the  network  moves  input  tokens  to  output  wires  in  increasing  order  modulo  tv. 
Balancing  networks  having  this  property  are  called  counting  networks  because  they  can 
easily  be  adapted  to  count  the  total  number  of  tokens  that  have  entered  the  network. 
Counting  is  done  by  adding  a  “local  counter”  to  each  output  wire  i ,  so  that  tokens 
coming  out  of  that  wire  are  consecutively  assigned  numbers  i,  i  +  w, . . . ,  i  +  (y<  —  l)u>. 
(This  application  is  described  in  greater  detail  in  Section  5.) 

The  step  property  can  be  defined  in  a  number  of  ways  which  we  will  use  interchange¬ 
ably.  The  connection  between  them  is  stated  in  the  following  lemma: 

Lemma  2.2  If  y0, . . .  ,yw-i  is  a  sequence  of  non-negative  integers,  the  following  state¬ 
ments  are  all  equivalent: 

1.  For  any  i  <  j ,  0  <  y,  —  y3  <  1 . 

2.  Either  y,  =  yj  for  all  i,j,  or  there  exists  some  c  such  that  for  any  i  <  c  and  j  >  c, 
Vi  -  Vj  =  1- 

3-  =  hrl- 

It  is  the  third  form  of  the  step  property  that  makes  counting  networks  usable  for  count¬ 
ing. 

Proof:  We  will  prove  that  1  implies  2,  2  implies  3  and  3  implies  1. 

Assume  1  holds  for  the  sequence  y0, . . . ,  yw-i-  If  for  every  0  <  i  <  j  <  w,  y,~  y:  =  0, 
then  2  follows.  Otherwise,  there  exists  the  largest  a  such  that  there  is  a  b  for  which 
a  <  b  and  ya  —  yt,  =■  1.  From  a’s  being  largest  we  get  that  ya  —  j/a+1  =  1,  and  from  1  we 
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get  yi  =  ya  for  any  0  <  i  <  a  and  yi  =  ya+i  for  any  a  +  1  <  i  <  w.  Choosing  c  =  a  +  1 
completes  the  proof.  Thus  1  implies  2. 

Assume  by  way  of  contradiction  that  3  does  not  hold  and  2  does.  Without  loss  of 
generality,  there  thus  exists  the  smallest  a  such  that  m  =  Y%= To  3/>  and  ya  i=-  •  If 

ya  <  then  since  Vi  =  m,  by  simple  arithmetic  there  must  exist  a  b  ^  a  such 

that  yi,  >  [^7^] ,  and  similarly  if  ya  >  [!2^] ,  there  exists  a  b  ^  a  such  that  yb  <  [:“]  • 
Since  | yb  —  ya\  >  2,  no  c  as  in  2  exists,  a  contradiction.  Thus  2  implies  3. 

Finally,  for  any  indexes  a  <  b,  since  0  <  a  <  b  <  u>,  it  must  be  that  0  <  < 

[~^j  <  1.  Thus  3  implies  1.  ■ 

The  requirement  that  a  quiescent  counting  network’s  outputs  have  the  step  prop¬ 
erty  might  appear  to  tell  us  little  about  the  behavior  of  a  counting  network  during  an 
asynchronous  execution,  but  in  fact  it  is  surprisingly  powerful.  Even  in  a  state  in  which 
many  tokens  are  passing  through  the  network,  the  network  must  eventually  settle  into 
a  quiescent  state  if  no  new  tokens  enter  the  network.  This  constraint  makes  it  possible 
to  prove  such  important  properties  as  the  following: 

Lemma  2.3  Suppose  that  in  a  given  execution  a  counting  network  with  output  sequence 

j/n . tf„-i  is  in  a  state  where  m  tokens  have  entered  the  network  and  m'  tokens  have 

left  it.  Then  there  exist  non-negative  integers  d{,  0  <  i  <  w,  such  that  dt  =  m  —  m' 
and  yi  +  dt  =  f~|  • 

Proof:  Suppose  not.  There  is  some  execution  e  for  which  the  non-negative  integers  dt. 
0  <  i  <  w  do  not  exist.  If  we  extend  e  to  a  complete  ex^e«>ticn  al'^wing  no  additional 
tokens  to  enter  the  network,  then  at  the  end  of  e'  the  network  will  be  in  a  quiescent 
state  where  the  step  property  does  not  hold,  a  contradiction.  ■ 

In  a  sequential  execution,  where  tokens  traverse  the  network  one  at  a  time,  the 
network  is  quiescent  every  time  a  token  leaves.  In  this  case  the  i-th  token  to  enter  will 
leave  on  output  i  mod  w.  The  lemma  shows  that  in  a  concurrent,  asynchronous  execution 
of  any  counting  network,  any  “gap”  in  this  sequence  of  mod  w  counts  corresponds  to 
tokens  still  traversing  the  network.  This  critical  property  holds  in  any  execution,  even  if 
quiescent  states  never  occur,  and  even  though  the  definition  makes  no  explicit  reference 
to  non-quiescent  states. 
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Figure  3:  Recursive  Structure  of  a  BlTONl('[8]  (’ountiug  Network. 

2.2  Counting  vs.  Sorting 

A  balancing  network  and  a  comparison  network  are  isomorphic  if  one  can  be  constructed 
from  the  other  by  replacing  balancers  by  comparators  or  vice  versa.  Ihe  counting 
networks  introduced  in  this  paper  are  isomorphic  to  the  Bitonic  sorting  network  of 
Batcher  [5]  and  to  the  Periodic  Balanced  sorting  network  ot  Dowd.  Peri.  Rudolph  and 
Saks  [fi].  There  is  a  sense  in  which  constructing  counting  networks  is  “harder"  than 
constructing  sorting  networks: 

Theorem  2.4  If  a  balanciiuj  mtwork  counts,  then  its  isomorphic  comparison  network 
sorts,  but  not  vice  versa. 

Proof:  It  is  easy  to  verify  that  balancing  networks  isomorphic  to  the  F.VKN  Ortl)  or 
iNSF.imON  sorting  networks  [7]  are  not  counting  networks. 

For  the  other  direction.  w<  <  oust  met  a  mapping  from  the  comparison  network  tran¬ 
sitions  to  the  isomorphic  balancing  network  transitions. 

Bv  the  0-1  principle  [7],  a  comparison  network  which  sorts  all  sequences  of  0  s  and 
l's  is  a  sorting  network,  lake  any  arbitrary  sequence  of  Os  and  1  s  as  inputs  to  tin' 
comparison  network,  and  tor  the  babmeiug  network  pk.ee  a  tok*'e  «ach  0  input  wire 
and  no  token  on  each  1  input  wire.  We  now  show  that  if  we  run  both  networks  in 
lockstep,  the  balancing  network  will  simulate  the  comparison  network. 

On  ('very  gate  where’  two  O's  meet  in  the  comparison  network,  two  tokens  meet  in 
the  balancing  network,  so  two  O's  leave  on  each  wire  in  the  comparison  network,  and 
both  tokens  leave  in  the  balancing  network.  On  every  gate  where  two  1  s  meet  in  the 
comparison  network,  no  tokens  meet  in  the  balancing  network,  so  two  I  s  leave  on  each 
wire  in  t  he  comparison  net  work,  and  no  tokens  leave  in  t  he  balancing  net  work.  On  every 
gate  where  a  0  and  1  meet  in  t  he  comparison  net  work,  t  he  0  leaves  on  t  he  lower  wire  and 
the  l  on  the  upper  wire,  while  in  the  balancing  network  the  token  leaves  on  the  lower 
wire,  and  no  token  on  the  upper  wire. 

If  the  balancing  network  is  a  counting  network,  i.e..  it  has  the  step  property,  then 
t.he  comparison  network  must  have  sorted  the  input  sequence  of  0  s  and  1  s.  ■ 
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Corollary  2.5  The  depth  of  any  counting  network  is  at  least  fl(logn). 


3  A  Bitonic  Counting  Network 

Counting  networks,  of  course,  would  not  be  interesting  if  we  could  not  exhibit  examples  of 
constructible  networks.  In  this  section  we  describe  how  to  construct  a  counting  network 
whose  width  is  any  power  of  2.  The  layout  of  this  network  is  isomorphie  to  Batcher's 
famous  Bitonic  sorting  network  [5,  7],  though  its  behavior  and  correctness  arguments 
are  completely  different.  We  give  an  inductive  construction,  as  this  will  later  aid  us  in 
proving  its  correctness. 

Define  the  width  w  balancing  network  MERGER[u>]  as  follows.  It  has  two  sequences 
of  inputs  of  length  re/2,  x  and  x\  and  a  single  sequence  of  outputs  y,  of  length  w. 
MERGER[u>]  will  be  constructed  to  guarantee  that  in  a  quiescent  state  where  the  se¬ 
quences  x  and  x'  have  the  step  property,  y  will  also  have  the  step  property,  a  fact  which 
will  be  proved  in  the  next  section. 

We  define  the  network  MERGER[u>]  inductively  (see  example  in  Figure  4).  Since 
w  is  a  power  of  2,  we  will  repeatedly  use  the  notation  2k  in  place  of  w.  When  k  is 
equal  to  1,  the  Merger[2A-]  network  consists  of  a  single  balancer.  For  k  >  1,  we 
construct  the  MERGER(2&]  network  from  two  MERGER[&]  networks  and  k  balancers. 
Using  a  MERGER)/:]  network  we  merge  the  even  subsequence  x0,x2 ,  .  ■  ■  ,  x*_2  of  x  with 
the  odd  subsequence  x'j,  x'3, ....  (i.e.,  x0, . . . ,  Xk-2 ,  . . . ,  x'k_}  is  the  input  to  the 

MERGER[&]  network.)  while  with  a  second  MERGER[F]  network  we  merge  the  odd  sub¬ 
sequence  of  x  with  the  even  subsequence  of  x'.  Call  the  outputs  of  these  two  MERGER  [k] 
networks  z  and  z' .  The  final  stage  of  the  network  combines  z  and  z'  by  sending  each 
pair  of  wires  z ,  and  z\  into  a  balancer  whose  outputs  yield  y2x  and  3/21+1- 

The  Merger[u>]  network  consists  of  log  w  layers  of  w/2  balancers  each.  MeRGER[tc] 
guarantees  the  step  property  on  its  outputs  only  when  its  inputs  also  have  the  step 
property—  but  we  can  ensure  this  property  by  filtering  these  inputs  through  smaller 
counting  networks.  We  define  BlTONir[u;j  to  be  the  network  constructed  by  passing 
the  outputs  from  two  BlTONIc[te/2]  networks  into  a  MERGER[u,’J  network,  where  the 
induction  is  grounded  in  the  BlTONIcfl]  network  which  contains  no  balancers  and  simply 
passes  its  input  directly  to  its  output.  This  construction  gives  us  a  network  consisting 
of  layers  each  consisting  of  w/2  balancers. 
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Merger[8]  Merger[8] 

Figure  4:  A  MERGER  [8]  balancing  network. 

3.1  Proof  of  Correctness 

In  this  section  we  show  that  DlTONK'ftc]  is  a  counting  network.  Before  examining  the 
network  itself,  we  present  some  simple  lemmas  about  sequences  having  t  he  step  property. 

Lemma  3.1  If  a  sequence  has  the  step  property,  then  so  do  all  its  subsequences. 


Lemma  3.2 

satisfy: 


If  Xq  ,  ...,Xfc_i  has  the  step  property ,  then  its  even  and  odd  subsequences 


Y  x7i  =  53  x'!~  and  Y  -t2.+i  =  Y  x'l'2 


Proof:  Either  x7l  =  X2,+i  f°r  0  <  i  <  k/ 2,  or  by  Lemma  2.2  there  exists  a  unique 
j  such  that  x-2j  =  x2j+i  +  1  and  x2l  =  x2,+i  for  all  i  ^  j,  0  <  i  <  k/2.  In  the 
first  case,  Y  x2i  -  X2x2,+i  =  Y  x>/~.  and  in  the  second  case  Yx2 i  =  [12  Ji/2]  and 

I>2.+  1  =  [E-C./2J-  ■ 


Lemma  3.3  Let  Xo,...,x*_i  and  y0< . . . ,  yk-\  be  arbitrary  sequences  haring  the  strp 
property.  If  YiZ 0  =  £,*=0  ?/.•  then  x,  =  y,  for  all  0  <  ?'  <  k. 

Proof:  Let  m  —  Yx «  =  Ylh-  By  Leinina  2.2,  x,  =  y,  =  f11^]  •  ■ 


Lemma  3.4  Let  x0,...,x*._i  and  J/o,  -  •  • ,  Vk-i  be  arbitrary  sequences  having  the  strp 
property.  If  YiZ 0  xi  =  £,*=0  J/,  +  1 ,  then  there  exists  a  unique  j,  0  <  j  <  k,  such  that 
xj  =  Vi  +  1  -  and  xi  -  Vi  for  1  /  j,  0  <  1  <  k. 
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Proof:  Let  m  =  =  Ej/t  +  1-  By  Lemma  2.2,  x,  =  [^p]  and  y;  =  j . 

These  two  terms  agree  for  all  i,  0  <  i  <  k,  except  for  the  unique  ?  such  that  *  =  m  —  1 
(mod  k ).  ■ 

We  now  show  that  the  Merger[ic]  networks  preserves  the  step  property. 

Lemma  3.5  If  Merger[2&]  is  quiescent,  and  its  inputs  xq , . . . ,  x*_j  and  x'0 , . . . ,  x'k_1 
both  have  the  step  property,  then  its  outputs  yo>  •  •  • ,  y2k-i  have  the  step  property. 

Proof:  We  argue  by  induction  on  log  k. 

If  2k  =  2,  MERGER[2fc]  is  just  a  balancer,  so  its  outputs  are  guaranteed  to  have  the 
step  property  by  the  definition  of  a  balancer. 

If  2k  >  2,  let  z0). . .  be  the  outputs  of  the  first  MERGERffc]  subnetwork,  which 
merges  the  even  subsequence  of  x  with  the  odd  subsequence  of  x1,  and  let  z'0, . . . ,  z'k_l 
be  the  outputs  of  the  second.  Since  x  and  x'  have  the  step  property  by  assumption,  so 
do  their  even  and  odd  subsequences  (Lemma  3.1),  and  hence  so  do  z  and  z'  (induction 
hypothesis).  Furthermore,  £zt  =  rX>./2]  +  [E^/2J  and  !>'  =  +  FE*i/21 

(Lemma  3.2).  A  straightforward  case  analysis  shows  that  E  zt  and  E  z\  can  differ  by  at 
most  1. 

We  claim  that  0  <  y,  —  y}  <  1  for  any  i  <  j.  If  E  =  E  zii  then  Lemma  3.3  implies 
that  Zi  =  z\  for  0  <  i  <  k/2.  After  the  final  layer  of  balancers, 


Vt  ~  2/j  -  z[i/ 2J  -  2b/2J’ 

and  the  result  follows  because  z  has  the  step  property.  Similarly,  if  E  zi  an(i  E  z',  differ 
by  one,  Lemma  3.4  implies  that  z,  =  z\  for  0  <  i  <  k/2,  except  for  a  unique  j  such  that 
Zj  and  z'  differ  by  one.  The  difference  0  <  y,  —  y}  <  1  for  any  i  <  j  can  be  expressed  as 
the  difference  between  earlier  and  later  terms  either  of  z  or  of  z',  and  the  result  follows 
because  these  two  sequences  both  have  the  step  property.  ■ 


The  proof  of  the  following  theorem  is  now  immediate. 


Theorem  3.6  In  any  quiescent  state,  the  outputs  o/BlTONIc[u>]  have  the  step  property. 
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4  A  Periodic  Counting  Network 


In  this  section  we  show  that  the  bitonic  network  is  not  the  only  counting  network  with 
depth  0(log2n).  We  introduce  a  new  counting  network  with  the  interesting  property 
that  it  is  periodic,  consisting  of  a  sequence  of  identical  subnetworks.  Each  stage  of  this 
periodic  network  is  interesting  in  its  own  right,  since  it  can  be  used  to  achieve  barrier 
synchronization  with  low  contention.  This  counting  network  is  isomorphic  to  the  elegant 
balanced  periodic  sorting  network  of  Dowd,  Perl,  Rudolph,  and  Saks  [6].  However,  its 
behavior,  and  therefore  also  our  proof  of  correctness,  are  fundamentally  different. 

We  start  by  defining  chains  and  cochains,  notions  taken  from  [6].  Given  a  sequence 
x  =  {x,jt  =  0,  ...,n  —  1},  it  is  convenient  to  represent  each  index  (subscript)  as  a 
binary  string.  The  level  i  chain  of  x  is  the  subsequence  of  x  whose  indices  have  the 
same  i  low-order  bits.  For  example,  the  subsequence  xE  of  entries  with  even  indices  is  a 
level  1  chain,  as  is  the  subsequence  x°  of  entries  with  odd  indices.  The  A-cochain  of  x 
denoted  xA,  is  the  subsequence  whose  indices  have  the  two  low-order  bits  00  or  11.  For 
example,  the  A-cochain  of  the  sequence  xq,  . . .  ,  x7  is  x0,  £3,  xA,xj.  The  B-cochain  xB  is 
the  subsequence  whose  low-order  bits  are  01  and  10. 

Define  the  network  BLOCK  [A;]  as  follows.  When  k  is  equal  to  2,  the  BLOCK  [fc]  net¬ 
work  consists  of  a  single  balancer.  The  BLOCK[2fc]  network  for  larger  k  is  constructed 
recursively.  We  start  with  two  BLOCK [&j  networks  A  and  B.  Given  an  input  sequence 
x,  the  input  to  A  is  xA ,  and  the  input  to  B  is  xB .  Let  y  be  the  output  sequence  for  the 
two  subnetworks,  where  yA  is  the  output  sequence  for  A  and  yB  the  output  sequence  for 
B.  The  final  stage  of  the  network  combines  each  yA  and  yf  in  a  single  balancer,  yielding 
final  outputs  z2 ,  and  22«+i-  Figure  5  describes  the  recursive  construction  of  a  BLOCK  [8] 
network.  The  PERIODIc[2A:]  network  consists  of  log  k  BLOCK[2A:]  networks  joined  so 
that  the  ith  output  wire  of  one  is  the  ith  wire  of  the  next.  Figure  6  is  a  PERIODIc[8] 
counting  network  3 

This  recursive  construction  is  quite  different  from  the  one  used  by  Dowd  et  al.  We 
chose  this  construction  because  it  yields  a  substantially  simpler  and  shorter  proof  of 
correctness. 

4.1  Proof  of  Correctness 

In  the  proof  we  use  the  technical  lemmas  about  input  and  output  sequences  presented 
in  Section  3.  The  following  lemma  will  serve  a  key  role  in  the  inductive  proof  of  our 
construction: 

3Despite  the  apparent  similarities  between  the  layouts  of  the  Block  and  Merger  networks,  there 
is  no  permutation  of  wires  that  yields  one  from  the  other. 
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Block[8]  Block[8] 

Figure  5:  A  BLOCK  [8]  balancing  network. 

Lemma  4.1  For  i  >  1, 

1.  The  level  i  chain  of  x  is  a  level  i  —  1  chain  of  one  of  x ’s  cochains. 

2.  The  level  i  chain  of  a  cochain  of  x  is  a  level  i  +  1  chain  of  x. 

Proof:  Follows  immediately  from  the  definitions  of  chains  and  cochains.  ■ 

As  will  be  seen,  the  price  of  modularity  is  redundancy,  that  is,  balancers  in  lower  level 
blocks  will  be  applied  to  sub-sequences  that  already  have  the  desired  step  property.  We 
therefore  present  the  following  lemma  that  amounts  to  saying  that  applying  balancers 
“evenly”  to  such  sequences  does  not  hurt: 

Lemma  4.2  If  x  and  x'  are  sequences  each  having  the  step  property,  and  pairs  x,  and 
x't  are  routed  through  a  balancer,  yielding  outputs  yi  and  y[,  then  the  sequences  y  and  y' 
each  have  the  step  property. 

Proof:  For  any  i  <  j,  given  that  x  and  x'  have  the  step  property,  0  <  x,  —  x_,  <  1  and 

0  <  x\  —  x'  <  1  and  therefore  the  difference  between  any  two  wires  is  0  <  x,  +  x'  —  (xj  + 

x'j)  <  2.  By  definition,  for  any  i ,  y,  =  and  y[  =  ,  and  so  for  any  i  <  j,  it 

is  the  case  that  0  <  y,  —  y:  <  1  and  0  <  y\  —  y'}  <  1,  implying  the  step  property.  ■ 

To  prove  the  correctness  of  our  construction  for  Periodic[A-],  we  will  show  that  if  a 
block’s  level  i  input  chains  have  the  step  property,  then  so  do  its  level  i  —  1  output  chains, 
for  i  in  {0, . . . ,  log  k  —  1 }.  This  observation  implies  that  a  sequence  of  log  A'  BLOCK  [A-] 
networks  will  count  an  arbitrary  number  of  inputs. 
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Lemma  4.3  Let  Block[2£]  be  quiescent  with  input  sequence  x  and  output  sequence  y. 
If  xE  and  x°  both  have  the  step  property,  so  does  y. 

Proof:  We  argue  by  induction  on  log  k.  The  proof  is  similar  to  that  of  lemma  3.5. 

For  the  base  case,  when  2k  =  2,  BLOCK[2fc]  is  just  a  balancer,  so  its  outputs  are 
guaranteed  to  have  the  step  property  by  the  definition  of  a  balancer. 

For  the  induction  step,  assume  the  result  for  Block  [fc]  and  consider  a  BLOCK  [2fc]. 
Let  x  be  the  input  sequence  to  the  block,  2  the  output  sequence  of  the  nested  blocks  A 
and  B ,  and  y  the  block’s  final  output  sequence.  The  inputs  to  A  are  the  level  2  chains 
xEE  and  x°° ,  and  the  inputs  to  B  are  xEO  and  xOE .  By  Lemma  4.1,  each  of  these  is  a 
level  1  chain  of  xA  and  xB .  These  sequences  are  the  inputs  to  A  and  B ,  themselves  of 
size  k,  so  the  induction  hypothesis  implies  that  the  outputs  zA  and  zB  of  A  and  B  each 
has  the  step  property. 

Lemma  3.2  implies  that  0  <  ^  xEE  —  Ylxf°  5;  1  and  0  <  Y1X?E  ~  Ylx?°  <  1- 
It  follows  that  the  sum  of  A’s  inputs,  YlxfE  +  Ylx?°,  and  the  sum  of  B's  inputs, 
Y,  xf°  +  Y1  x?°  i  differ  by  at  most  1.  Since  balancers  do  not  swallow  or  create  tokens, 
Y2  zA  and  £  zB  also  differ  by  at  most  1.  If  they  are  equal,  then  Lemma  3.3  implies  that 
zf  -  zf  -  Zn  =  Z2i+ 1.  For  i  <  j, 

y.  -yj  =  4m  ~  4m 

and  the  result  follows  because  zA  has  the  step  property. 

Similarly,  if  Y2  zf  and  H  zf  differ  by  one,  Lemma  3.4  implies  that  zf  =  zf  for 
0  <  i  <  k,  except  for  a  unique  I  such  that  zf  and  zf  differ  by  one.  If  i  <  j  and  i  ^  2 £ 
and  j  21  +  1,  then  y,  —  t/j  is  equal  to  the  difference  between  earlier  and  later  terms  of 
either  zA  or  zB ,  and  the  result  follows  because  the  latter  have  the  step  property.  Finally, 
since  zf  and  zf  are  joined  by  a  balancer  in  the  last  layer,  y2t  —  y2(+\  —  L  and  the  result 
is  established.  ■ 

Theorem  4.4  Let  BLOCK [2fc]  be  quiescent  with  input  sequence  x  and  output  sequence 
y.  If  all  the  level  i  input  chains  to  a  block  have  the  step  property,  then  so  do  all  the  level 
i  —  1  output  chains. 

Proof:  We  argue  by  induction  on  i.  Lemma  4.3  provides  the  base  case,  when  i  is  1. 

For  the  induction  step,  assume  the  result  for  chains  up  to  i  —  1.  Let  x  be  the  input 
sequence  to  the  block,  z  the  output  sequence  of  the  nested  blocks  A  and  B,  and  y  the 
block’s  final  output  sequence.  If  i  >  1,  Lemma  4.1  implies  that  every  level  i  chain  of  x 
is  entirely  contained  in  one  cochain  or  the  other.  Each  level  i  chain  of  x  contained  in  xA 
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Figure  6:  A  Periodic  [8]  counting  network. 

(xB)  is  a  level  i  —  1  chain  of  xA  ( xB ),  each  has  the  step  property,  and  each  is  an  input  to 
A  ( B ).  The  induction  hypothesis  applied  to  A  and  B  implies  that  the  level  i  —  2  chains 
of  zA  and  zB  have  the  step  property.  By  Lemma  4.1  implies  that  the  level  i  —  2  chains 
of  zA  and  zB  are  the  level  i  —  1  chains  of  2.  By  Lemma  4.2,  if  the  level  i  —  1  chains  of  2 
have  the  step  property,  so  do  the  level  i  —  1  chains  of  y.  ■ 


By  Theorem  2.4,  the  proof  of  Theorem  4.4  constitutes  a  simple  alternative  proof  that 
the  balanced  periodic  comparison  network  of  [6]  is  a  sorting  network. 


5  Implementation  and  Applications 

In  ,  MIMD  shared-memory  architecture,  a  balancer  can  be  represented  as  a  record 
with  two  fields:  toggle  is  a  boolean  value  that  alternates  between  0  and  1,  and  next  is 
a  2-element  array  of  pointers  to  successor  balancers.  A  balancer  is  a  leaf  if  it  has  no 
successors.  A  process  shepherds  a  token  through  the  network  by  executing  the  procedure 
shown  in  Figure  7.  It  toggles  the  balancer’s  state,  and  visits  the  next  balancer,  halting 
when  it  reaches  a  leaf.  Advancing  the  toggle  state  can  be  accomplished  either  by  a  short 
critical  section  guarded  by  a  spin  lock4,  or  by  a  read-modify-write  operation  ( rmw  for 
short)  if  the  hardware  supports  it.  Note  that  all  values  are  bounded. 

We  illustrate  the  utility  of  counting  networks  by  constructing  highly  concurrent  im¬ 
plementations  of  three  common  data  structures:  shared  counters,  producer/consumer 
buffers,  and  barriers.  In  Section  6  we  give  some  experimental  evidence  that  counting 
network  implementations  have  higher  throughput  than  conventional  implementations 
when  contention  is  sufficiently  high. 

4  A  spin  lock  is  just  a  shared  boolean  flag  that  is  raised  and  lowered  by  at  most  one  processor  at  a 
time,  while  the  other  processors  wait. 
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balancer  =  [toggle:  boolean,  next:  array  [0..1]  of  ptr  to  balancer] 
traverse(b:  balancer) 
loop  until  leaf(b) 

i  :=  rmu?(b. toggle  :=  -i  b. toggle) 
b  :=  b.next[i] 
end  loop 
end  traverse 


Figure  7:  Code  for  Traversing  a  Balancing  Network 

5.1  Shared  Counter 

A  shared  counter  [9,  8,  13,  26]  is  a  data  structure  that  issues  consecutive  integers  in 
response  to  increment  requests.  More  formally,  in  any  quiescent  state  in  which  m  incre¬ 
ment  requests  have  been  received,  the  values  0  to  m  —  1  have  been  issued  in  response.  To 
construct  the  counter,  start  with  an  arbitrary  width-u;  counting  network.  Associate  an 
integer  cell  c;  with  the  ith  output  wire.  Initially,  C{  holds  the  value  i.  A  process  requests 
a  number  by  traversing  the  counting  network.  When  it  exits  the  network  on  wire  i,  it 
atomically  adds  w  to  the  value  of  c,  and  returns  c;’s  previous  value. 

Lemmas  2.1  and  2.3  imply  that: 

Lemma  5.1  Let  x  be  the  largest  number  yet  returned  by  any  increment  request  on  the 
counter.  Let  R  be  the  set  of  numbers  less  than  x  which  have  not  been  issued  to  any 
increment  request.  Then 

1.  The  size  of  R  is  no  greater  than  the  number  of  operations  still  in  progress. 

2.  If  y  €  R,  then  y  >  x  —  u>|R|. 

3.  Each  number  in  R  will  be  returned  by  some  operation  in  time  A  •  d  -f  Ac,  where 
d  is  the  depth  of  the  network,  A  is  the  maximum  balancer  delay,  and  Ac  is  the 
maximum  time  to  update  a  cell  on  an  output  wire. 

5.2  Producer/Consumer  Buffer 

A  producer/ consumer  buffer  is  a  data  structure  in  which  items  inserted  by  a  pool  of  m 
producer  processes  are  removed  by  a  pool  of  m  consumer  processes.  The  buffer  algorithm 
used  here  is  essentially  that  of  Gottlieb,  Lubachevsky,  and  Rudolph  [13].  The  buffer  is 
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a  tc-element  array  buff[Q..w  —  1].  There  are  two  ic-width  counting  networks,  a  producer 
network,  and  a  consumer  network.  A  producer  starts  by  traversing  the  producer  network, 
leaving  the  network  on  wire  i.  It  then  atomically  inspects  buff[i],  and,  if  it  is  _L,  replaces 
it  with  the  produced  item.  If  that  position  is  full,  then  the  producer  waits  for  the  item 
to  be  consumed  (or  returns  an  exception).  Similarly,  a  consumer  traverses  the  consumer 
network,  exits  on  wire  j,  and  if  buff[j]  holds  an  item,  atomically  replaces  it  with  J_.  If 
there  is  no  item  to  consume,  the  consumer  waits  for  an  item  to  be  produced  (or  returns 
an  exception). 

Lemmas  2.1  and  2.3  imply  that: 

Lemma  5.2  Suppose  m  producers  and  m'  consumers  have  entered  a  producer/ consumer 
buffer  built  out  of  counting  networks  of  depth  d.  Assume  that  the  time  to  update  each 
buff[i]  once  a  process  has  left  the  counting  network  is  negligible.  Then  if  m  <  m',  every 
producer  leaves  the  network  in  time  2dA.  Similarly,  if  rn  >  m' ,  every  consumer  leaves 
the  network  in  time  2d  A. 


5.3  Barrier  Synchronization 

A  barrier  is  a  data  structure  that  ensures  that  no  process  advances  beyond  a  particular 
point  in  a  computation  until  all  processes  have  arrived  at  that  point.  Barriers  are  often 
used  in  highly-concurrent  numerical  computations  to  divide  the  work  into  disjoint  phases 
with  the  property  that  no  Drocess  executes  phase  i  while  another  process  concurrently 
executes  phase  i  -f  1. 

A  simple  way  to  construct  an  n-process  barrier  is  by  exploiting  the  following  key 
observation:  Lemma  2.3  implies  that  as  soon  as  some  process  exits  with  value  n.  the 
last  phase  must  be  complete,  since  the  other  n  —  1  processes  must  already  have  entered 
the  network. 

We  present  a  stronger  result:  one  does  not  need  a  full  counting  network  to  achieve 
barrier  synchronization.  A  threshold  network  of  width  w  is  a  balancing  network  with 
input  sequence  Xi  and  output  sequence  y,,  such  that  the  following  holds: 

In  any  quiescent  state,  yw-\  =  m  if  and  only  if  miv  <  ^  x,  <  (m  +  l)w. 

Informally,  a  threshold  network  can  “detect”  each  time  w  tokens  have  passed  through 
it.  A  counting  network  is  a  threshold  network,  but  not  vice-versa. 

Both  the  BLOCK  [to]  network  used  in  the  periodic  construction  and  the  MERGER  [tc] 
network  used  in  the  bitonic  construction  are  threshold  networks,  provided  the  input 
sequence  satisfies  the  following  smoothness  property : 
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A  sequence  #o, xw-\  is  smooth  if  for  all  i  <  j,  |a*t  —  xj\  <  1. 

Every  sequence  with  the  step  property  is  smooth,  but  not  vice-versa.  The  following 
two  lemmas  state  that  smoothness  is  “stpMe”  under  partitioning  into  subsequences  or 
application  of  additional  balancers. 

Lemma  5.3  Any  subsequence  of  a  smooth  sequence  is  smooth. 

Lemma  5.4  If  the  input  sequence  to  a  balancing  network  is  smooth,  so  is  the  output 
sequence. 

Proof:  It  is  enough  to  observe  that  if  the  inputs  to  a  balancer  differ  by  at  most  one, 
then  so  do  the  outputs.  ■ 

Theorem  5.5  If  the  input  sequence  to  BLOCK  [u»]  is  smooth,  then  BLOCK  [u;]  is  a  thresh¬ 
old  network. 

Proof:  Let  X{  be  the  block’s  input  sequence,  z,  the  output  sequence  of  nested  blocks 
A  and  B,  and  y,  the  block’s  output  sequence. 

We  first  show  that  if  yw~\  =  m ,  then  mw  <  Ylxi  <  (m  +  l)tu.  We  argue  by  induction 
on  to,  the  block’s  width.  If  w  =  2,  the  result  is  immediate.  Assume  the  result  for  w  =  k 
and  consider  Block[2£]  in  a  quiescent  state  where  y^k-i  =  m.  Since  x  is  smooth  by 
hypothesis,  by  Lemma  5.4  so  are  z  and  y.  Since  t/2fc— i  and  V2k-2  are  outputs  of  a  common 
balancer,  t/2fc-2  is  either  morm+1.  The  rest  is  a  case  analysis. 

If  yzk—\  —  2/2fc— 2  =  m,  then  z2fc-i  =  Z2k-2  =  m ■  By  the  induction  hypothesis  and 
Lemma  5.3  applied  to  A  and  B,  mk  <  ^f,xf  <  (m  +  1  )k  and  mk  <  Ylxf  <  (m  +  1)&, 
and  therefore  2 mk  <  £  xf  -+  5Z  xf  <  2 (m  +  1  )k. 

If  V2k-2  =  m-(-l,  then  one  of  zf  and  zf  is  m,  and  the  other  is  m  -+  1.  Without 
loss  of  generality  suppose  zf  =  m  +  1  and  zf  =  m.  By  the  induction  hypothesis, 
(m  -+  1)A:  <  Jf,xf  <  {m  +  2 )k  and  mk  <  <  (m  +  1)&.  Since  x  is  smooth,  by 

Lemma  5.3  xB  is  smooth  and  some  element  of  xB  must  be  equal  m,  which  in  turn 
implies  that  no  element  of  xA  exceeds  m  +  1.  This  bound  implies  that  (m  +  l)k  =  £  xf. 
It  follows  that  2 mk  -+  k  <  £  xf  +  xf  <  2 (m  +  1  )k,  yielding  the  desired  result. 

We  now  show  that  if  m  w  <  $2  X{  <  (m  +  1)^,  then  yw-\  =  m.  We  again  argue  by 
induction  on  w,  the  block’s  width.  If  w  =  2,  the  result  is  immediate.  Assume  the  result 
for  w  =  k  and  consider  BLOCK [2k]  in  a  quiescent  state  where  2 mk  <  ^x,  <  2(m  -+  1  )k. 
Since  x  is  smooth,  by  Lemma  5.4  m  <  t/2i— i  -  Furthermore,  since  x  is  smooth,  by 
Lemma  5.3,  either  mk  <  5:  (m  +  1)&  and  mk  <  JZ xf  {rn  A  l)fr  "ice  versa, 

which  by  the  induction  hypothesis  implies  that  zf_ j  -+  zB_x  <  2m  -+  1.  It  follows  that 
y2k~\  <  m  -+  1,  which  completes  our  claim.  ■ 
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The  proof  that  the  Merger[u>]  network  is  also  a  threshold  network  if  its  inputs  are 
smooth  is  omitted  because  it  is  almost  identical  to  that  of  Theorem  5.5.  A  threshold 
counter  is  constructed  by  associating  a  local  counter  c,  with  each  output  wire  i,  just  as 
in  the  counter  construction. 

We  construct  a  barrier  for  n  processes,  where  n  =  0  mod  w,  using  a  width-tn  threshold 
counter.  The  construction  is  an  adaptation  of  the  “sense- reversing”  barrier  construction 
of  [14]  as  follows.  Just  as  for  the  counter  construction,  we  associate  a  local  counter  c, 
with  each  output  wire  i.  Let  F  be  a  boolean  flag,  initially  false.  Let  a  process’s  phase 
at  a  given  point  in  the  execution  of  the  barrier  algorithm  be  defined  as  0  initially,  and 
incremented  by  i  every  time  the  process  begins  traversing  the  network.  With  each  phase 
the  algorithm  will  associate  a  sense,  a  boolean  value  reflecting  the  phase’s  parity:  true 
for  the  first  phase,  false  for  the  second,  and  so  on.  As  illustrated  in  Figure  8.  the  token 
for  process  P,  after  a  phase  with  sense  s,  enters  the  network  on  wire  P  mod  w.  If  it 
emerges  with  a  value  not  equal  to  n  —  1  mod  n,  then  it  waits  until  F  agrees  with  s  before 
starting  the  next  phase.  If  it  emerges  with  value  n  —  1  mod  n,  it  sets  F  to  s,  and  starts 
the  next  phase. 

As  an  aside,  we  note  that  a  threshold  counter  implemented  from  a  Block[£]  network 
can  be  optimized  in  several  additional  ways.  For  example,  it  is  only  necessary  to  associate 
a  local  counter  with  wire  w—  1,  and  that  counter  can  be  modulo  n  rather  than  unbounded. 
Moreover,  all  balancers  that  are  not  on  a  path  from  some  input  wire  to  exit  wire  w  —  1 
can  be  deleted. 

Theorem  5.6  If  P  exits  the  network  with  value  n  after  completing  phase  <j>,  then  every 
other  process  has  completed  phase  <f>,  and  no  process  has  started  phase  <f>  -fi  1. 

Proof:  We  first  observe  that  the  input  to  BLOCK [in]  is  smooth,  and  therefore  it  is  a 
threshold  network.  We  argue  by  induction.  When  P  receives  value  v  —  n  at  the  end  of 
the  first  phase,  exactly  n  tokens  must  have  entered  BLOCK  [in],  and  all  processes  must 
therefore  have  completed  the  first  phase.  Since  the  boolean  F  is  still  false ,  no  process  has 
started  the  second  phase.  Assume  the  result  for  phase  <j>.  If  Q  is  the  process  that  received 
value  n  at  the  end  of  that  phase,  then  exactly  d>n  tokens  had  entered  the  network  when 
Q  performed  the  reset  of  F.  If  P  receives  value  v  =  n  at  the  end  of  phase  <p  +  1,  then 
exactly  (f>  +  l)n  tokens  have  entered  the  network,  implying  that  an  additional  n  tokens 
have  entered,  and  all  n  processes  have  finished  the  phase.  No  process  will  start  the  next 
phase  until  F  is  reset.  ■ 
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barrier  () 

v  :=  exit  wire  of  traverse( wire  P  mod  w) 
if  v  =  n  —  1  (mod  w) 
then  F  s 
else  wait  until  F  =  s 
end  if 
s  :=  -<s 
end  barrier 

Figure  8:  Barrier  Synchronization  Code 

6  Performance 

6.1  Overview 

In  this  section,  we  analyze  counting  network  throughput  for  computations  in  which 
tokens  are  eventually  spread  evenly  through  the  network.  To  ensure  that  tokens  are 
evenly  spread  across  the  input  wires,  each  processor  could  be  assigned  a  fixed  input 
wire,  or  processors  could  choose  input  wires  at  random. 

The  network  saturation  S  at  a  given  time  is  defined  to  be  the  ratio  of  the  number 
of  tokens  n  present  in  the  network  (i.e.  the  number  of  processors  shepherding  tokens 
through  it)  to  the  number  of  balancers.  If  tokens  are  spread  evenly  through  the  network, 
then  the  saturation  is  just  the  expected  number  of  tokens  at  each  balancer.  For  the 
BlTONIC  and  Periodic  networks,  S  =  2 n/wd.  The  network  is  oversaturated  if  S  >  1, 
and  undersaturated  if  S  <  1. 

An  oversaturated  network  represents  a  full  pipeline,  hence  its  throughput  is  domi¬ 
nated  by  the  per-balancer  contention,  not  by  the  network  depth.  If  a  balancer  with  S 
tokens  makes  a  transition  in  time  A (5),  then  approximately  w  tokens  emerge  from  the 
network  every  A (S)  time  units,  yielding  a  throughput  of  u>/A(5).  A  is  an  increasing 
function  whose  exact  form  depends  on  the  particular  architecture,  but  similar  measures 
of  degradation  have  been  observed  in  practice  to  grow  linearly  [3,  20].  The  throughput 
of  an  oversaturated  network  is  therefore  maximized  by  choosing  w  and  d  to  minimize  S , 
bringing  it  as  close  as  possible  to  1. 

The  throughput  of  an  undersaturated  network  is  dominated  by  the  network  depth, 
not  by  the  per-balancer  contention,  since  the  network  pipeline  is  partially  empty.  Ev¬ 
ery  0(1/5)  time  units,  w  tokens  leave  the  network,  yielding  throughput  O(wS).  The 
throughput  of  an  undersaturated  network  is  therefore  maximized  by  choosing  w  and  d 
to  increase  5,  bringing  it  as  close  as  possible  to  1. 
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concurrency  (num.  of  proc.) 

Figure  9:  Bitonic  Shared  Counter  Implementations 

This  analysis  is  necessarily  approximate,  but  it  is  supported  by  experimental  evi¬ 
dence.  In  the  remainder  of  this  section,  we  present  the  results  of  timing  experiments  for 
several  data  structures  implemented  using  counting  networks.  As  a  control,  we  compare 
these  figures  to  those  produced  by  more  conventional  implementations  using  spin  locks 
These  implementations  were  done  on  an  Encore  Multimax,  using  Mul-T  [16],  a  parallel 
dialect  of  Lisp.  The  spin  lock  is  a  simple  “test-and-test-and-set”  loop  [21]  written  in  as¬ 
sembly  language,  and  provided  by  the  Mul-T  run-time  system.  In  our  implementations, 
each  balancer  is  protected  by  a  spin  lock. 
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Figure  10:  Periodic  Shared  Counter  Implementations 

6.2  The  Shared  Counter 

We  compare  seven  shared  counter  implementations:  bitonic  and  periodic  counting  net¬ 
works  of  widths  16,  8,  and  4,  and  a  conventional  spin  lock  implementation  (which  can  be 
considered  a  degenerate  counting  network  of  width  2).  For  each  network,  we  measured 
the  elapsed  time  necessary  for  a  220  (approximately  a  million)  tokens  to  traverse  the 
network,  controlling  the  level  of  concurrency. 

For  the  bitonic  network,  the  width- 16  network  has  80  balancers,  the  width-8  network 
has  24  balancers,  and  *he  width-4  network  has  6  balancers.  In  Figure  9,  the  horizontal 
axis  represents  the  number  of  processes  executing  concurrently.  When  concurrency  is  1, 
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Figure  11:  Producer/Consumer  Buffer  Implementations 


each  process  runs  to  completion  before  the  next  one  starts.  The  number  of  concurrent 
processes  increases  until  all  sixteen  processes  execute  concurrently.  The  vertical  axis  rep¬ 
resents  the  elapsed  time  (in  seconds)  until  all  220  tokens  had  traversed  the  network.  With 
no  concurrency,  the  networks  are  heavily  undersaturated,  and  the  spin  lock's  throughput 
is  the  highest  by  far.  As  saturation  increases,  however,  so  does  the  throughput  for  each 
of  the  networks.  The  width-4  network  is  undersaturated  at  concurrency  levels  less  than 
6.  As  the  level  of  concurrency  increases  from  1  to  6.  saturation  approaches  1,  and  the 
elapsed  time  decreases.  Beyond  6,  saturation  increases  beyond  1,  and  the  elapsed  time 
eventually  starts  to  grow.  The  other  networks  remain  undersaturated  for  the  range  of 
the  experiment;  their  elapsed  times  continue  to  decrease.  Each  of  the  networks  begins 
to  outperform  the  spin  lock  at  concurrency  levels  between  8  and  12.  When  concurrency 
is  maximal,  all  three  networks  have  throughputs  at  least  twice  the  spin  lock's.  Notice 
that  as  the  level  of  concurrency  increases,  the  spin  lock’s  performance  degrades  in  an 
approximately  linear  fashion  (because  of  increasing  contention). 

The  performance  of  the  periodic  network  ( Figure  10)  is  similar.  The  width-4  network 
reaches  saturation  l  at  8  processes;  its  throughput  then  declines  slightly  as  it  becomes 
oversaturated.  The  other  networks  remain  undersaturated,  and  their  throughputs  con¬ 
tinue  to  increase.  Each  of  the  counting  networks  outperforms  the  spin  lock  at  sufficiently 
high  levels  of  contention.  At  16  processes,  the  width-4  anti  width-8  networks  have  almost 
twice  the  throughput  of  the  single  spin-lock  implementation.  Each  bitonic  network  has 
a  slightly  higher  throughput  than  its  periodic  counterpart. 

6.3  Producer/Consumer  Buffers 

We  compare  the  performance  of  several  producer/consumer  buffers  implemented  using 
the  algorithm  of  Gottlieb,  Lubachevskv,  and  Rudolph  [13]  discussed  in  Section  5.  Each 
implementation  has  8  producer  processes,  which  continually  produce  items,  and  8  con¬ 
sumer  processes,  which  continually  consume  items.  If  a  producer  (consumer)  process 
finds  its  buffer  slot  full  (empty),  it  spins  until  the  slot  becomes  empty  (full). 

We  consider  buffers  with  bitonic  and  periodic  networks  of  width  2.  4.  and  8.  As 
a  final  control,  we  tested  a  circular  buffer  protected  by  a  single  spin  lock,  a  structure 
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Figure  12:  Barrier  Implementations 


that  permits  no  concurrency  between  producers  and  consumers.  Figure  6.2  shows  the 
time  in  seconds  needed  to  produce  and  consume  220  tokens.  Not  surprisingly,  the  single 
spin-lock  implementation  is  much  slower  than  any  of  the  others.  The  width-2  network 
is  heavily  oversaturated,  the  bitonic  width-4  network  is  slightly  oversaturated,  while  the 
others  are  undersaturated. 


6.4  Barrier  Synchronization 

Figure  12  shows  the  time  (in  seconds)  taken  by  16  processes  to  perform  216  barrier 
synchronizations.  The  remaining  columns  show  BLOCK  [A:]  networks  of  width  4,  8.  and 
16.  The  last  column  shows  a  simple  sense- reversing  barrier  in  which  the  Block  network 
is  replaced  by  a  single  counter  protected  by  a  spin  lock.  The  three  network  barriers  are 
equally  fast,  and  each  takes  about  two-thirds  the  time  of  the  spin-lock  implementation. 


7  Verifying  That  a  Network  Counts 

The  “0-1  law”  states  that  a  comparison  network  is  a  sorting  network  if  (and  only  if) 
it  sorts  input  sequences  consisting  entirely  of  zeroes  and  ones,  a  property  that  greatly 
simplifies  the  task  of  reasoning  about  sorting  networks.  In  this  section,  we  present  an 
analogous  result:  a  balancing  network  having  m  balancers  is  a  counting  network  if  (and 
only  if)  it  satisfies  the  step  property  for  all  sequential  executions  in  which  up  to  2m  tokens 
have  traversed  the  network.  This  result  simplifies  reasoning  about  counting  networks, 
since  it  is  not  necessary  to  consider  all  concurrent  executions.  However,  as  we  show, 
the  number  of  tokens  passed  through  the  network  in  the  longest  of  these  sequential 
executions  cannot  be  less  than  exponential  in  the  network  depth. 

We  begin  by  proving  that  it  suffices  to  consider  only  sequential  executions. 

Lemma  7.1  Let  .s  be.  a  valid  schedule  of  a  given  balancing  netwoik.  Then  there  exists 
a  valid  sequential  schedule  s'  such  that  the  number  of  tokens  which  pass  through  each 
balancer  in  s  and  s'  is  equal. 
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Proof:  Let  s  =  Sq  •  p  ■  q  ■  S\,  where  so,  Si  are  sequences  of  transitions,  p  and  q  are  indi¬ 
vidual  transitions  involving  distinct  tokens  P  and  Q,  and  where  is  the  concatenation 
operator.  If  p  and  q  do  not  occur  at  the  same  balancer,  then  so  •  q  ■  p  •  &\  is  a  valid 
schedule.  If  p  and  q  do  occur  at  the  same  balancer,  then  sq  ■  q  ■  p  ■  s\  is  a  valid  schedule 
where  s'j  is  constructed  from  Si  by  swapping  the  identities  of  P  and  Q.  In  each  case  we 
can  swap  p  and  q  without  changing  the  preceding  sequence  of  transitions  s0  and  without 
changing  the  number  of  tokens  that  pass  through  any  balancer  during  the  execution. 

Now  suppose  that  s  is  a  complete  schedule.  We  will  transform  if  into  a  sequential 
schedule  by  a  process  similar  to  selection  sorting.  Choose  some  total  ordering  of  the 
tokens  in  s.  Split  s  into  So  •  io  where  sq  is  the  empty  sequence  and  t0  —  s.  Now 
repeatedly  carry  out  the  following  procedure  which  constructs  s,+1  •  t,+ 1  from  s,  ■  t,: 
while  t,  is  nonempty  let  p  be  the  earliest  transition  in  tt  whose  token  is  ordered  as  less 
than  or  equal  to  all  tokens  in  <t.  Move  p  to  the  beginning  of  tx  by  swapping  it  with 
each  earlier  token  in  t,  as  described  above,  and  let  s,+i  =  s,  ■  p  and  tt+i  be  the  suffix  of 
the  resulting  schedule  after  p.  This  procedure  is  easily  seen  to  maintain  the  following 
invariant: 

1.  After  stage  i,  s,  •  t,  is  a  valid  schedule  in  which  each  balancer  passes  the  same 
number  of  tokens  as  in  s. 

2.  After  stage  i,  s,  is  sorted  by  token. 

Thus  when  the  procedure  terminates,  we  have  a  valid  sequential  schedule  s'  in  which 
each  balancer  passes  the  same  number  of  tokens  as  in  s.  ■ 

Theorem  7.2  A  balancing  network  with  rn  balancers  satisfies  the  step  property  in  all 
executions  if  (and  only  if)  it  satisfies  it  in  all  sequential  executions  in  which  up  to  2m 
tokens  traverse  the  network. 

Proof:  Since  by  definition  the  step  property  depends  only  on  the  number  of  tokens 
that  pass  through  the  network’s  output  wires,  it  follows  from  Lemma  7.1  that  a  balancing 
network  satisfies  the  step  property  in  all  executions  if  (and  only  if)  it  satisfies  it  in  all 
sequential  executions.  It  remains  to  be  shown  that  verifying  the  step  property  in  all 
executions  involving  at  most  2m  tokens  will  suffice. 

Consider  sequential  executions  of  a  balancing  network  with  in  balancers.  When  the 
network  is  quiescent,  its  state  is  completely  characterized  by  specifying  for  each  balancer 
the  output  wire  to  which  it  will  send  the  next  token,  yielding  a  maximum  of  2"'  distinct 
quiescent  states.  In  a  sequential  execution,  each  time  a  token  traverses  the  network,  it 
carries  the  network  from  one  quiescent  state  to  another.  Thus,  in  anv  execution,  after 
at  most  2m  tokens  the  network  must  reenter  a  previously  occupied  state.  ■ 
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How  tight  is  this  bound?  We  now  construct  a  balancing  network  that  is  not  a 
counting  network,  yet  satisfies  the  step  property  for  any  execution  in  which  the  number 
of  tokens  is  less  than  exponential  in  the  network  depth. 

First,  consider  the  following  balancing  network  STAGE  [2id].  Take  two  counting  net¬ 
works  A  and  B  of  width  w  having  outputs  wires  do  through  and  60  through  bw_i 
respectively.  Add  a  layer  of  w  balancers  such  that  the  i-th  balancer  has  inputs  a,  and 
bw- 1_,  and  outputs  a\  and  b'w_1_i.  The  resulting  network  STAGE  [2u>]  is  not  a  counting 
network;  however,  it  is  easily  extended  to  one  by  virtue  of  the  following  lemma. 

Lemma  7.3  For  any  input  to  STAGE  [2u>],  there  exists  a  permutation  na  of  the  output 
sequence  a'0, . . . ,  and  a  permutation  7T(,  of  the  output  sequence  b'0, . . . ,  b,w_1  such  that 
the  sequence  na{a'Q, . . . ,  a'w_x )  ■  ■Kb[b'0, . . . ,  b'w_l)  has  the  step  property. 

Proof:  Let  us  begin  by  showing  that  the  total  inputs  to  any  two  balancers  in  the  last 
layer  differ  by  at  most  1.  Since  the  sequences  a0, . . . ,  aw_i  and  60, . . . ,  bw- 1  have  the  step 
property,  there  exists  a  c“  (similarly  there  exists  a  cb)  such  that  a,  =  a o  if  i  <  ca  and 
dj  =  a0  —  1  if  i  >  c°. 

Suppose  ca  <  w—  1  —  cb.  Then  at  +  bw^i_i  i  a0  +  {bo  —  1)  for  i  <  ca,  (d0—  l)  +  (6o~  1) 
for  ca  <  i  <  w  —  1  —  cfc,  and  (d0  —  1)  +  b0  for  i  >  w  —  1  —  cb .  A  similar  analysis  shows 
that  when  ca  >  w  —  1  —  cb  each  d;  +  bw^x~i  is  either  do  +  b0  or  d0  -f  b0  —  1. 

Thus  there  is  always  a  k  such  that  every  balancer  in  the  last  layer  outputs  either 
k  or  k  +  1  tokens.  If  k  is  even,  then  b\  =  k/2  for  all  i  and  d'  =  d,  -f  —  k/ 2, 

which  is  either  k/2  or  k/2  +  1.  One  can  obtain  a  sequence  with  the  step  property  by 
setting  7ra  to  sort  the  values  in  a1.  If  k  is  odd,  then  each  d'  is  (k  +  l)/2  and  each  b\  is 
d„,_i_i  +  6,  —  ( k  +  l)/2,  which  will  be  either  (k  -f  l)/2  or  ( k  -f  l)/2  —  1.  In  this  case 
having  7T(,  sort  the  values  in  b'  produces  the  desired  result.  ■ 

By  Lemma  2.2  it  follows  that 

Corollary  7.4  For  any  m  tokens  input  to  STAGE  [2u>],  a\  =  fm  —  i/2xv\  and 
T.Vo'Vi  =  T!X'\m-i/2wl 

In  other  words,  the  total  number  of  tokens  that  end  up  on  the  a'0, .  . . ,  a,w_1  (respectively 
b'0, .  .  . ,  b'w_ j)  output  wires  is  the  same  as  in  a  proper  counting  network. 

An  immediate  consequence  of  the  Lemma  7.3  and  Theorem  7.2  is  that  if  we  pass 
the  outputs  «0’  •  ■  •  i  aw-i  and  b'0, . . . ,  b'w_l  to  two  separate  balancing  networks,  each  of 
which  is  isomorphic  to  a  sorting  network,  we  will  obtain  a  (not  very  efficient)  counting 
network.  But  we  are  not  interested  in  getting  a  working  counting  network;  what  we  wish 
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to  construct  is  a  balancing  network  which  counts  all  input  sequences  up  to  some  bound, 
but  fails  on  sequences  with  more  tokens. 

We  construct  such  a  balancing  network  (denoted  ALMOST  [2u>])  as  follow’s.  Take  a 
STAGE  [ 2w ]  network  and  modify  it  by  picking  some  x  other  than  0  or  w  —  1  and  deleting 
the  final  balancer  between  ax  and  Denote  this  balancing  network  as  Stagei[2u>]. 

Let  ALMOST  [2te]  be  the  periodic  network  constructed  from  k  stages,  for  some  k  >  0, 
each  a  Stagex[2u?]  network,  the  outputs  of  one  stage  connected  to  the  inputs  of  the 
next. 

Let  At  and  Bt  be  the  sums  of  the  number  of  tokens  input  to  each  of  the  two  sub¬ 
networks  A  and  B  in  the  t- th  stage  of  ALMOST  [2u>].  Let  y  =  {yo, . . . ,  be  the 

sequence  given  by  y,-  =  |”(Ao  +  Bq  —  i)/2w]  (that  is,  y;  is  the  number  of  tokens  that 
would  exit  on  output  wire  i  if  ALMOST  [2fc]  were  a  counting  network).  Ac*,  =  Vii 

and  B^  =  1 /«•  Note  that  At  +  Bt  =  A0  +  B0  =  A^  +  B^  for  all  t  and  that  by 

Lemma  2.2,  f( A —  i)/w]  =  y,-  and  f(Boo  —  l)/u/j  =  yw+ ,  for  all  i. 

Finally,  let  the  imbalance  8t  =  At  —  A^  =  —(Bt  —  Boo);  this  quantity  represents 
“how  far”  the  network  is  from  balancing  the  tokens  between  the  A  and  B  subnetworks 
in  stage  £,  in  other  words,  how  many  excess  tokens  must  be  moved  from  the  A  part  of 
the  network  to  the  B  part  (clearly,  if  the  quantity  is  negative  then  tokens  should  be 
moved  from  B  to  A). 

The  following  lemma  follows  from  arguments  almost  identical  to  those  of  Lemma  5.4. 

Lemma  7.5  If  the  input  sequence  to  a  balancing  network  has  the  step  property,  then  so 
does  the  output  sequence. 

Lemma  7.6  If  8t  =  0  then  the  output  sequences  of  stage  t  of  ALMOST  [2ie]  have  the 
step  property. 

Proof:  If  8t  =  0,  then  A<  =  A^,  so  a,  =  f(A(  —  i)/w~\  =  [(AlX)  —  t)/tu]  =  y,  for  each 
i  (Lemma  2.2);  similarly  6,  =  yw+,  and  thus  the  outputs  of  the  counting  networks  form 
the  sequence  y.  Since  y  has  the  step  property  it  is  left  unchanged  by  the  final  layer  of 
balancers  (Lemma  7.5).  ■ 

Lemma  7.7  8l+l  =  | , 

Proof:  If  a  balancer  were  placed  between  a'x  and  b'w_l_x  after  stage  t.  then  the  STAGE'r[2rr] 
network  would  become  a  STAGE  [2u»]  counting  network,  and  by  Corollary  7.4,  exactly 
Aoo  tokens  would  emerge  from  the  A  half  of  the  network  after  stage  t  -f  l.  Therefore, 
removing  the  balancer  shifts  precisely  this  number  of  tokens  (possibly  negative)  from 
the  B  part  of  the  network  to  the  A  part.  ■ 
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where  0  <  C!  <  2  and  0  <  c2  <  1,  which,  adding  0  <  <  w  and  0  <  x  <  w  —  1, 

implies  —  1  <  c  <  5/2.  ■ 

Lemma  7.9  If  6t  ^  0  then  |5t+1|  <  |<5f|  -  1. 

Proof:  It  is  clear  from  Lemma  7.7  that  6t+1  and  St  must  have  the  same  sign;  thus  we 
need  only  show  that  6  increases  when  it  is  negative  and  decreases  when  it  is  positive.  By 
definition,  if  St  ^  0  the  sequence  a0, . . . ,  a,,,-] ,  60, . . . ,  &u,_i  (recall  these  are  the  outputs 
of  the  A  and  B  parts  in  the  t  -f  1-th  STAGE1  before  the  last  layer  of  balancers)  does  not 
have  the  step  property.  Since  each  of  the  sequences  a0, . . . ,  aw-i  and  bo, ,  bw- j  in  itself 
has  the  step  property,  the  step  property  of  a0, . . . , aw-i,b0, . . . ,  6„,_i  must  be  violated  in 
one  of  the  following  two  ways  by  some  a,  and  by. 

1.  a,  <  bj.  Then  St  <  0  (or  else  y,  <  <  bj  <  yw+J  which  contradicts  the  step 

property  of  y.)  Furthermore,  since  a<  >  aw-i  and  bo  >  b-j  it  follows  that  bo  >  aw- j, 
and  at  least  one  token  is  moved  from  B  to  A  by  the  balancer  between  those  two 
outputs,  increasing  6. 

2.  a,  >  bj  +  2.  Then  6t  >  0;  furthermore  a0  >  a,  >  bj  +  2  >  bw-i  +  2;  thus  the 
balancer  between  a0  and  bw_ j  will  move  at  least  one  token  from  A  to  B,  reducing 
6. 


27 


Theorem  7.10  There  exists  a  width-2w  balancing  network  that  has  the  step  property  in 
all  executions  with  up  to  w^k~^  tokens,  yet  is  not  a  counting  network. 

Proof:  Lemmas  7.6  and  7.9  together  imply  that  if  |64|  <  s,  then  the  outputs  of  stage 
t  +  s  will  have  the  step  property.  We  may  conclude  from  Lemma  7.8  that  |<5t+i|  < 
\6t\/w  +  5/2.  Solving  this  recurrence  yields  the  upper  bound  |<$t|  <  |<?0|uj~f  +  (5/2)  • 

Now  suppose  the  network  is  given  an  input  involving  at  most  w*  tokens.  Then  |<5o| 
cannot  possibly  exceed  w *,  and  after  t  stages  |6t|  <  1  -f  (5/2)^"^  <  5;  since  |6t|  must 

be  an  integer,  it  follows  that  |5t|  <  4.  Thus  the  outputs  of  stage  t  +  4  will  have  the  step 
property,  and  a  network  with  k  =  t  +  4  stages  will  count  up  to  u/*-4*  tokens. 

To  see  that  this  k- stage  network  is  not  a  counting  network,  suppose  |60|  >  5te^+1*. 
From  Lemma  7.8  it  follows  that  |6i+j|  >  | St\/w  —  5/2,  and  solving  as  above  yields 
|6fc+i|  >  |<$0|tn-(fc+1)  —  (5/2)  — >  1.  Since  6*;+1  ^  0,  the  outputs  of  stage  k  (and 
hence  the  entire  network)  cannot  have  the  step  property.  ■ 


8  Discussion 

Counting  networks  deserve  further  study.  We  believe  that  they  represent  a  start  toward 
a  general  theory  of  low-contention  data  structures.  Work  is  needed  to  develop  other 
primitives,  to  derive  upper  and  lower  bounds  and  new  performance  measures.  We  have 
made  a  start  in  this  direction  by  deriving  constructions  and  lower  bounds  for  linearizable 
counting  networks  [15],  networks  which  guarantee  that  the  values  assigned  to  tokens 
reflect  the  real-time  order  of  their  traversals.  Work  is  also  needed  in  experimental 
directions,  comparing  counting  networks  to  other  techniques,  for  example  those  based 
on  exponential  backoff  [1],  and  for  understanding  their  behavior  in  architectures  other 
than  the  single-bus  architecture  provided  by  the  Encore. 

We  close  by  raising  an  open  question:  does  there  exist  an  0(log  n)-depth  count¬ 
ing  network?  From  Theorem  7.2,  one  can  easily  show  that  “ smoothing  +  sorting  — 
counting ,”  that  is,  given  a  balancing  network  which  smoothes  its  output  sequence  (see 
Section  5.3),  and  a  balancing  network  isomorphic  to  any  sorting  network,  the  balancing 
network  constructed  by  joining  the  outputs  wires  of  the  first  to  the  input  wires  of  the 
second  is  a  counting  network  (Karchmer  and  Klugerman  [18]  have  recently  used  this  ob¬ 
servation  to  construct  an  0(log  n  log  logrc)  depth  counting  network  based  on  [2]).  Since 
it  is  known  that  there  exists  an  0(log  n)-depth  sorting  network  [2],  it  follows  that  there 
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exists  an  0(log  n)-depth  counting  network  if  and  only  if  there  exists  an  O(log  n)-depth 
smoothing  network.  5 
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